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Abstract: Data fusion is the name given to a range of methods for combining multiple sources of evidence* This mini-review 
summarizes the use of one such class of methods for combining the rankings obtained when similarity searching is used for ligand- 
based virtual screening* Two main approaches are described: similarity fusion involves combining rankings from single searches 
based on multiple similarity measures; and group fusion involves combining rankings from multiple searches based on a single 
similarity measure* The review then focuses on the rules that are available for combining similarity rankings, and on the evidence 
that exists for the superiority of fusion-based methods over conventional similarity searching* 
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Introduction 

Virtual screening involves ranking a database of previously 
untested molecules in order of decreasing probability of biological 
activity, and is an increasingly important component of lead-discovery 
programmes in the agrochemical and pharmaceutical industries [1-4]. 
There are two main approaches: structure-based virtual screening, 
which requires knowledge of the 3D structure of the biological target; 
and ligand-based virtual screening, which requires knowledge of at 
least some ligands that exhibit the desired bioactivity. In this paper, 
we focus on similarity searching, which is arguably the simplest, and 
probably the most widely, used approach currently available for 
ligand-based virtual screening [5-9]. 

In its simplest form, similarity searching assumes the existence of 
at least one active (or potentially active) molecule, which is normally 
referred to as the reference or target structure, and a database of 
molecules that have not, thus far, been tested in the assay of interest. 
If one assumes that molecules that are structurally similar are likely to 
have similar properties, an assumption that is normally referred to as 
the similar property principle [10], then the molecules most similar to 
the reference structure are those with the greatest probabilities of 
activity, and hence prime candidates for biological testing. 

There are very many different ways in which inter-molecular 
similarities can be computed, but all measures comprise three basic 
components: the representation that characterizes each molecule; the 
weighting scheme that is used to (de)prioritise different parts of the 
representation to reflect their relative importance; and the similarity 
coefficient that provides a numeric value for the degree of similarity 
between two weighted representations. Many different types of 
representation have been reported in the literature [7, 8, II] but these 
are all of three basic types: sets of computed molecular properties 
(such as molar volume, molecular weight, numbers of heteroatoms, log 
octanol/water partition coefficient etc) yielding so-called ID 
representations; topological (or 2D) representations encoding patterns 
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of atoms and bonds; and representations that encode 3D atom 
coordinate or shape information. There have been only limited 
discussions of weighting schemes for similarity searching [12, 13] but 
many studies of the various types of representation and similarity 
coefficient that are available [7, 8, 14-17]. Combining the three 
components hence enables the creation of very large numbers of 
possible similarity measures, with several detailed comparisons 
available that seek to establish the most appropriate for chemical 
similarity searching [8, 18-20]. However, it has become widely 
recognised that no single measure can be expected to provide the best 
level of search effectiveness in all circumstances [2, 18, 21-23], with 
the result that researchers have looked for ways of combining the 
results obtained from use of multiple similarity searches. This is 
normally effected using the technique known as data fusion [24]; an 
analogous combination approach, there called consensus scoring, is 
also widely used in structure-based virtual screening [25]. 

Data fusion 

The term 'data fusion* is used to describe a range of methods for 
combining information that has been obtained in digital form from 
different sources, with the aim of producing a fused source that is 
more informative than are individual data sources [26-28]. The 
techniques are used in many different application areas [29]. When 
used for similarity searching, a data source is a similarity measure that 
calculates a similarity score for each of the structures in a database and 
then ranks the structures in decreasing order of these scores, where the 
scores (or the ranks, vide infra) are assumed to reflect the probabilities 
of each of the database structures exhibiting the same biological 
activity as the reference structure. The availability of multiple sources 
of information means that combining several different similarity 
rankings to give a single fused ranking is expected to provide a 
superior level of screening effectiveness than will the ranking obtained 
from any single similarity measure. 

The basic procedure that has been developed for similarity 
searching is shown in algorithmic form below. 
FORx:= I to n DO 

FOR 7 := I to NDO 

Calculate the similarity, SIMJ^d^), for the j/-th database- 
structure using the x-th similarity scoring function 
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FOR 7:= I to NDO 

Use a fusion rule, F, to combine the set of n scores {SIMJ^d^)} 
for the y-th database-structure to give its fused score, FSIM r 
Sort the database into decreasing order of the fused scores, FSIM r 

In this algorithm, there are n different ways for calculating the 
similarity SIMJ^d^) for each of the TV structures in the database that is 
being searched (I < x < n, and I < y < TV). The fusion rule, F, is a 
procedure that combines the set of n different similarity scores for 
each database structure, y, to a yield the final fused score, FSIM r . The 
TV fused scores, one for each database structure, are then sorted into 
decreasing order to provide the final output of the similarity search. 
The procedure is shown diagrammatically in Figures I and 2. The 
yellow shading denotes the database that is to be searched and the 
purple ovals in Figure I denote the sets of top-ranked molecules 
retrieved in three individual similarity searches, e.g., those occurring in 
the top-I% of the rankings. Some of these retrieved molecules are 
active, as denoted by the red circles. Figure 2 shows the application of 
a fusion rule to the three individual search outputs, with the resulting 
combined output, e.g., the top-I% of the fused ranking, containing a 
greater concentration of actives than do the outputs in Figure I from 
the three individual searches. 




variations in effectiveness that are often encountered in individual 
searches using conventional approaches to similarity-based screening. 
The group-fusion approach was first studied in detail by Willett et al 
at Sheffield, comparing the results obtained with those from similarity 
fusion and from conventional similarity searching [30, 35, 36]. They 
found that group fusion was notably superior to the other two 
approaches, especially when searching for structurally heterogeneous 
sets of active molecules, and group fusion has become widely used as 
standard technique for ligand-based virtual screening [7]. 
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Figure 1. Individual search outputs for three similarity searches (the 
purple ovals) of a chemical database (the yellow volume), with highly 
similar active molecules denoted by the red circles. 



The fusion procedure, as described digrammatically in the figures 
or algorithmically in the pseudo-code is completely general in nature 
and can be implemented in several different ways. 

First, one must specify the nature of the n different searches that 
are carried out [30]: in similarity fusion, n different similarity 
measures are used to search the database with a single reference 
structure; and in group fusion, n different reference structures are used 
to search the database with a single similarity measure. The similarity 
fusion approach was the first to be discussed in the late Nineties. 
Sheridan et ah at Merck described the fusion of pairs of rankings 
generated using different types of fingerprint [31, 32] while Ginn et 
al at Sheffield described the fusion of 2D, 3D and spectral rankings 
generated using different types of similarity coefficient [33, 34]. Both 
groups found that data fusion gave search results that were generally at 
least as effective as the best individual similarity searches, and that 
multiple sources of information could lessen the rather substantial 



Figure 2. Combined search output resulting from the application of a 
fusion rule to the three individual search outputs in Figure 1. 



Second, one must specify what is fused once the searches have 
been carried out. The algorithm above assumes that it is the actual 
similarities, i.e., the set of n scores {SIMJ^d^)} for the y-th database- 
structure, that are combined to give the fused score that forms the 
basis for the final ranking that is presented to the user. Alternatively, 
the fusion rule can be applied to the ranks of the TV database 
structures when all of the similarity scores are ranked in decreasing 
order [30, 34, 37]. Ranks are derived from similarities and hence 
provide less information; however, they are useful in similarity fusion 
when, as is often the case, the n different similarity measures give 
similarity scores that follow non-identical frequency distributions and 
that could hence introduce some degree of bias into the results. For 
example, if similarities are calculated using the cosine coefficient and 
the Tanimoto coefficient, which are two of the best-known and most 
widely used similarity coefficients [38], then the cosine scores will 
always be greater than the Tanimoto scores (except at the extremal 
values of zero and unity, when the two will be identical). 

Thus far, we have referred to the combination of different 
rankings, so as to produce a single output ranking, without specifying 
how the combination is achieved in practice. This is the function of a 
fusion rule, and many such rules have been reported in the literature as 
described in the following section. 

Fusion rules 

Using the notation in the algorithm above, the basic input to a 
fusion rule comprises n (n> 2) sets of TV similarities or ranks and the 
output is a ranking of the TV structures comprising the database that is 
being searched. The many fusion rules that have been discussed in the 
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literature are of two basic types: an unsupervised rule operates directly 
on the similarity or rank information, whereas a supervised rule 
requires an initial training procedure using available structure-activity 
data. In this mini-review we focus on the former class of rules since 
they have been more widely used to date; however, we shall exemplify 
the latter class by describing work on belief theory carried out by a 
group at Abbott [39]. 

Fusion is normally implemented by applying simple arithmetic 
operations on the lists of similarity scores (or ranks) resulting from 
the n searches, and these arithmetic fusion rules are reviewed in some 
detail by Chen et al [40]. The two most common examples of this 
class are the so-called MAX and SUM rules. Using the terminology 
in the algorithm shown previously, the fused score FSIM r for the 
MAX rule has the form 

Max{S/Afi(djO,S/M 2 ^ 

i.e., it assigns the y-th database-structure, d y , a score that is the largest 
of the n similarities to the reference structure that have been 
calculated; while the fused score for the SUM rule has the form 



one (all but two, all but three etc.) ranked lists, a procedure that Cross 
et al found to be superior to the SUM rule when used with rank data. 

Unsupervised fusion rules, such as those described above, require 
just the n sets of TV similarity scores (or the resulting ranks) as inputs, 
whereas the many supervised rules additionally require a quantitative 
relationship between the structural similarity of two molecules and 
their corresponding similarities in activity. Several such approaches 
have been described [45, 47-49] as exemplified by the recent study of 
Muchmore et al on data fusion using belief theory [39]. This 
involves the calculation of a degree of belief in some outcome given 
the evidence available from different sources, i.e., belief in the activity 
of a database structure given its similarities to the reference structure 
in a set of similarity searches. Muchmore et al analysed a large in- 
house file of screening data to identify the similarities, using various 
similarity measures, between pairs of molecules that had comparable 
activities, and were hence able to derive a relationship between Br, the 
belief that a pair of molecules are equally active using the x-th 
similarity measure, and SIMx, the similarity score for the x-th 
similarity measure. The rule for combining the individual beliefs for a 
given database structure in each of the n similarity searches is 



n 

Y,SIM X (dy) 

x=l 

and hence assigns d y a score that is the sum (or, equivalently, the 
arithmetic mean) of the n individual similarities. An early comparison 
of arithmetic rules for similarity fusion by Ginn et al suggested that 
the SUM rule was generally the most effective [34]. However, Hert et 
al found that the MAX rule was notably more effective for group 
fusion when similarity scores were to be fused [35, 41]. This finding 
was confirmed in a very detailed comparative study by Nasr et al that 
used over 40 public datasets [42] and the approach has now been 
widely adopted (see, e.g., [7, 43-45]). 

Although defined above in terms of similarity scores, SIMJ^d^), 
such arithmetic rules are equally applicable to the rank data, 
RANKJ^dy^j, obtained when the similarity scores are sorted into 
descending order. Chen et al describe a further rule, the reciprocal 
rank fusion (RRF) rule, that is applicable only to rank data and that 
derives from the fact that virtual screening often involves applying a 
cut-off on the similarity scores (such as the top-I%) so that only a 
small fraction of the database is considered further in a project [40]. 
Let p (p < ii) be the number of times that an individual database 
structure d y , occurs above the chosen cut-off. Then the RRF rule 
involves summing the reciprocal ranks for those p occurrences to give 
a fused score 

P 1 

I — — • 

S RANKx(d y ) 

Chen et al found that RRF out-performed all of the other rules that 
they considered in their detailed comparative study. They ascribed 
this to the close relationship they were able to demonstrate between 
the reciprocal rank of a database structure and its probability of 
activity as determined from an analysis of sets of bioactive molecules 
in the MDDR and WOMBAT databases. 

Another, more complex fusion rule has been described recently by 
Cross et al [46]. Fusion here is based on Pareto ranking, where the 
Pareto rank of each database structure is the number of structures that 
have a larger similarity score in all of the n ranked lists that are to be 
fused. Ties in this initial Pareto ranking are then resolved by 
considering the number of molecules with larger similarities in all but 



x=l 

and Muchmore et al found that this rule yielded rankings that were 
comparable to those resulting from use of the SUM rule but that were 
easier to interpret [39]. The approach has subsequently been used for 
lead-hopping [50] and for combining the results of ligand-based and 
structure-based virtual screening [51]. 

The increasing availability of large volumes of linked chemical and 
biological data means that supervised fusion rules are likely to become 
more widely used in the future; currently, unsupervised rules provide a 
simple, widely used approach to the effective combination of multiple 
search outputs. 

Why does data fusion work? 

The basic assumption in data fusion is that the availability of 
multiple sources of information (i.e., similarity rankings in the present 
context) will yield better results than when just a single source is 
available. The review by Willett [24] summarized a range of studies 
demonstrating that this does indeed seem to be the case for ligand- 
based screening: fusion-based screening is often comparable with, or 
even superior to, the best of the screening methods that are being 
combined, especially when group fusion is used; and fusion results in a 
level of screening effectiveness that is far more consistent from search 
to search than is the case when just a single similarity method is 
available. Studies in Sheffield have investigated the reasons for the 
success of data fusion, using both empirical and theoretical approaches 
[52-55]. 

Whittle et al developed and tested an analytical model of fusion- 
based similarity searching [52-54]. The study focused on the use of 
the SUM and MAX rules in similarity fusion to combine pairs of 
rankings derived from searches with different similarity coefficients, 
but Whittle et al demonstrated that their methods could be extended 
to similarity fusion with different types of fingerprint or to group 
fusion. Assume that searches are carried out using two similarity 
coefficients, such as the cosine coefficient and the Tversky coefficient, 
and that one then plots the corresponding frequency distributions for 
the similarities between the reference structure and the database 
structures. Consider the numbers of similarities that are of magnitude 
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at least x using the cosine coefficient and at least y using the Tversky 
coefficient: an effective fusion rule will then be one that preferentially 
populates this portion of the joint frequency distribution with active 
molecules (or depopulates it with inactive molecules) when compared 
with the corresponding distributions for each of the individual 
coefficients. Whittle et at demonstrate that this is the case in practice 
for the SM and MAX fusion rules if, and only if, sufficient training 
data are available, since even the fusion of just two similarity lists 
requires information about eight distinct frequency distributions. If 
some cut-off, e.g., the top-I%, is applied to each ranking then the 
following distributions must be considered: those for the top-ranked 
actives and for the top-ranked inactives above the cut-off for each 
similarity coefficient for both the database structures that occur above 
the cut-off in both lists and for those occurring in just one of the lists. 
When such data are available then the model predicts that the MAX 
rule will perform better than the SUM rule for group fusion, that 
SUM will be better than MAX for similarity fusion, and that the 
former type of fusion is generally to be preferred. These predictions 
are fully in accord with the many previous empirical studies [24], 
hence validating the model and providing a rationale for why data 
fusion can indeed enhance the effectiveness of similarity searching. 
However, the model's complexity and the volume of training 
information that it requires means that it is most unlikely that it could 
be used, as was originally the hope, to predict the utility of new types 
of fusion rule. 

Drawing on work carried out by Spoerri on the use of data fusion 
to combine the outputs of text search engines [56], Holliday et aL 
have reported a systematic study of the use of multiple rankings for 
similarity-based virtual screening [55]. Their experiments used two 
standard test databases, the MDDR and WOMBAT databases [57], 
and similarity searches with five different similarity coefficients and 
five different types of fingerprint, i.e., a total of 25 different similarity 
measures. A similarity search was carried out for a bioactive reference 
structure using one of these measures and a note taken of the number 
of top-ranked database structures that had the same bioactivity as the 
reference structure (specifically, a database structure was assumed to 
have been retrieved in a screening search if it occurred in the top-I% 
of the ranking after the database had been ranked in order of 
decreasing similarity with the reference structure). This procedure was 
repeated for each of the other 24 similarity measures, so that it was 
possible to determine how many database structures were retrieved by 
just one measure, by just two measures, by just three measures etc. It 
was found that very many structures were retrieved in the top-I% of a 
single search but that the numbers of retrieved structures fell away 
very rapidly as one considered the top-I% of two searches, of three 
searches, of four searches etc. This behaviour was observed 
consistently across all the types of bioactivity that were searched for, 
suggesting that this is an entirely general phenomenon. Indeed, 
Holliday et at were able to demonstrate and to rationalise the 
existence of a power law relationship [58, 59] between the numbers of 
structures retrieved and the numbers of searches. Since there are 
decreasingly few structures common to increasing numbers of 
rankings, then data fusion will be effective when many of these 
common structures have the same bioactivity as the reference 
structure. Holliday et aL showed that only a small proportion of the 
many structures retrieved by a single search were active, but that this 
proportion increased rapidly as one considered the structures retrieved 
by two searches, the structures retrieved by three searches etc. The 
probability of activity of a database structure hence increases in line 
with its frequency of retrieval in multiple similarity searches, thus 
providing a simple, but direct, empirical justification for using 
combination methods to enhance the effectiveness of virtual screening. 



Summary and outlook 

Similarity searching is one of the most widely used methods for 
ligand-based virtual screening. A range of different types of similarity 
measure are available for this purpose, and data fusion provides a 
simple way of combining the results from multiple similarity searches 
to increase the effectiveness of screening above that normally 
obtainable from the use of a single similarity measure. Two 
approaches to fusion have been described in the literature: similarity 
fusion involves matching a single reference structure against a database 
using multiple similarity measures; while group fusion involves 
matching multiple reference structures against a database using a single 
similarity measure. If multiple actives are available then the latter 
procedure is normally to be preferred. 

The fusion rules that have attracted most attention thus far are 
unsupervised, in the sense that they do not require any training data 
relating similarity scores to probabilities of activity; however the 
increasing availability of such structure-activity data means that 
supervised rules provide an obvious focus for future research in data 
fusion. Other areas where developments may be expected include the 
combination of different types of virtual screening method, the 
comparison of supervised fusion with existing screening approaches 
based on machine learning (which also requires the availability of 
extensive training data), and further attempts to provide a theoretical 
underpinning for the use of fusion methods. 
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